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Abstract 

School  accountability  systems  in  the  United  States  have  been  criticized  on 
a  number  of  fronts,  mainly  on  grounds  of  completeness  and  fairness.  This 
study  examines  an  alternative  school  quality  framework — one  that  seemingly 
responds  to  several  core  critiques  of  present  accountability  systems. 
Examining  results  from  a  pilot  study  in  a  diverse  urban  district,  we  find  that  this 
alternative  system  captures  domains  of  school  quality  that  are  not  reflected  in 
the  current  state  system,  specifically  those  measuring  opportunity  to  learn  and 
socioemotional  factors.  Furthermore,  we  find  a  less  deterministic  relationship 
between  school  quality  and  poverty  under  the  alternative  system.  We  explore 
the  policy  implications  of  these  findings  vis-a-vis  the  future  of  accountability. 
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Over  the  past  two  decades,  policy  leaders  have  established  educational  mea¬ 
surement  and  accountability  systems  in  all  50  states.  These  systems  are 
intended  to  help  policymakers  identify  schools  in  need  of  support  and  inter¬ 
vention,  to  inform  and  empower  the  public,  and  to  establish  clear  and  consis¬ 
tent  goals  for  educators  and  school  leaders. 
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Whatever  their  successes,  however,  these  systems  continue  to  face  a  num¬ 
ber  of  challenges,  particularly  with  regard  to  capturing  the  multifaceted 
nature  of  school  quality.  Schools  serve  many  purposes  and  advance  multiple 
aims  through  a  variety  of  interconnected  practices  (Schneider,  2017;  Figlio  & 
Loeb,  2011;  Ladd  &  Loeb,  2013;  Rothstein  &  Jacobsen,  2006).  For  political 
and  practical  reasons,  however,  current  data  systems  have  generally  focused 
on  a  set  of  fairly  basic  outputs,  important  among  them  being  student-stan¬ 
dardized  test  scores  in  math  and  English.  As  a  result,  these  measurement  and 
accountability  systems  have  been  roundly  criticized — for  failing  to  capture 
the  full  picture  of  school  quality,  for  relying  too  heavily  on  measures  linked 
to  student  demography,  and  for  producing  a  range  of  unintended  conse¬ 
quences  such  as  curricular  narrowing  and  teaching-to-the-test  (e.g.,  ASCD, 
2014;  Cowley,  2006;  Darling-Hammond,  2004;  National  Education 
Association  [NEA],  2011;  Spalding,  2014). 

Our  study  examines  the  pilot  year  of  a  holistic  school  quality  measure¬ 
ment  system  built  for  a  mid-sized,  highly  diverse  urban  district  in 
Massachusetts.  Constructed  with  stakeholder  input  and  responding  to  several 
common  criticisms  of  present  efforts  to  measure  school  quality,  this  system 
represents  a  model  of  what  accountability  systems  of  the  future  might  look 
like.  Insofar  as  that  is  the  case,  analysis  of  it  may  help  answer  some  of  the  key 
questions  policymakers  face  as  they  revisit  state-level  measurement  and 
accountability  systems  under  the  Every  Student  Succeeds  Act  (ESSA). 

Ultimately,  our  analyses  reveal  that  although  there  is  some  correlation 
between  the  current  state  accountability  ratings  and  the  ratings  generated  by  a 
more  comprehensive  set  of  data,  the  inclusion  of  additional  measures  can  dra¬ 
matically  alter  the  overall  interpretation  of  school  quality.  As  we  find,  mea¬ 
surement  and  accountability  systems  that  draw  mainly  on  achievement  scores 
in  math  and  English  incompletely  capture  school  performance,  strongly  reflect 
demographic  variables,  and  consequently  may  foster  the  mistaken  view  that 
school  quality  is  a  uniform  concept.  In  short,  although  a  more  holistic  approach 
to  measurement  does  not  represent  a  perfect  solution,  it  does  appear  to  offer  a 
much  more  viable  basis  for  future  accountability  systems. 

Literature  Review 

Current  Data  Collection  and  Reporting  Practices 

No  Child  Left  Behind  (NCLB)  dramatically  expanded  the  amount  of  infor¬ 
mation  collected  by  states  about  school  and  district  performance,  using  stu¬ 
dent  outcomes  to  construct  accountability  systems  with  meaningful 
consequences.  Specifically,  the  law  required  states  to  collect  and  report  data 
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on  achievement  and  teacher  quality,  and  to  disaggregate  data  by  student  sub¬ 
groups  (U.S.  Department  of  Education  [USED],  2013).  Over  the  next  decade, 
schools  and  districts  bristled  under  NCLB,  and  scholars  highlighted  a  broad 
range  of  flaws  in  the  law.  Ultimately,  the  USED  responded  by  initiating  a 
waiver  process  that  led  to  significant  changes  in  accountability  in  many 
states,  such  as  the  elimination  of  requirements  surrounding  Adequate  Yearly 
Progress  and  the  goal  of  100%  proficiency.  Although  waivers  were  granted  to 
many  states,  they  largely  served  to  limit  and  structure  data  collection  and 
reporting,  rather  than  to  expand  or  reimagine  those  enterprises.  Consequently, 
despite  other  significant  variance  across  states  with  regard  to  education,  data 
collection  and  reporting  practices  continued  to  display  striking  similarity.  As 
Mikulecky  and  Christie  reported  in  2014,  all  states  incorporated  student 
achievement  and  graduation  rates  into  school  accountability  systems,  with  a 
majority  including  student  growth,  gap  closure,  and  proxies  for  postsecond¬ 
ary  and  career  readiness.  This  largely  remains  the  case  under  ESSA,  which 
replaced  NCLB  in  late  2015,  though  states  continue  to  revise  their  account¬ 
ability  frameworks  in  preparation  for  the  2017-2018  school  year  when  the 
law  will  go  into  full  effect. 

State  measurement  and  accountability  systems  do,  of  course,  include  more 
than  just  test  scores  in  math  and  English.  Still,  these  systems  largely  fail  to 
address  the  full  range  of  what  schools  do.  Several  core  subjects,  for  instance, 
go  untested,  and  therefore  unmeasured.  Current  measurement  and  account¬ 
ability  systems  largely  ignore  aspects  of  student  physical,  social,  and  emo¬ 
tional  health  emphasized  by  schools  (Schneider,  2017;  Downey,  von  Hippel, 
&  Elughes,  2008;  Mintrop  &  Sunderman,  2009).  And,  these  systems  largely 
fail  to  provide  information  that  might  lead  to  meaningful  improvements  in 
curriculum,  teacher  preparation,  and  school  resources  (Darling-Hammond, 
2007).  Moreover,  research  suggests  that  various  elements  of  school  quality 
are  not  intrinsically  aligned,  indicating  that  a  measurement  system  designed 
to  capture  some  elements  of  school  quality  will  not  necessarily  capture  others 
(e.g.,  Rumberger  &  Palardy,  2005). 

Current  measurement  and  accountability  systems  have  also  been  criti¬ 
cized  for  measuring  factors  that  are  not  largely  under  the  control  of  schools. 
As  research  has  revealed,  student  standardized  test  scores  tend  to  correlate 
strongly  with  student  demographic  characteristics  (Davis-Kean,  2005; 
Reardon,  2011),  and  school  rankings  tend  to  correlate  strongly  with  school- 
level  poverty  (Spalding,  2014).  Consequently,  test-based  outcome  variables 
may  tell  stakeholders  less  about  school  performance  than  about  families 
and  neighborhoods.  Insofar  as  that  is  the  case,  such  data  offer  little  in  the 
way  of  actionable  information,  and  may  unnecessarily  stigmatize  schools 
with  diverse  student  bodies. 
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Finally,  because  measurement  and  accountability  systems  shape  school 
and  district  goals  and  activities,  researchers  have  also  explored  the  degree  to 
which  narrow  conceptions  of  school  quality  have  produced  troubling  unin¬ 
tended  consequences.  As  scholars  have  shown,  current  systems  have  led  to 
narrowing  of  the  curriculum  and  an  increased  emphasis  on  test  preparation 
(Dee,  Jacob,  &  Schwartz,  2013;  Hamilton  et  al.,  2007;  Jennings  &  Bearak, 
2014).  In  addition,  such  systems  have  fostered  an  environment  in  which 
teachers  are  less  satisfied  with  their  jobs  (Markow  &  Pieters,  2012),  and  in 
which  students  exhibit  higher  levels  of  stress  (Segool,  Carlson,  Goforth,  von 
der  Embse,  &  Barterian,  2013). 

Measurement  and  accountability  systems  remain  in  place  as  cornerstones 
of  governance,  and  ESSA,  like  NCLB  before  it,  continues  to  place  significant 
weight  on  achievement  scores  and  graduation  rates  in  school  performance 
rankings.  That  said,  there  are  two  primary  reasons  to  suspect  that  data  sys¬ 
tems  will  change  under  the  new  law:  First,  ESSA  stipulates  that  states  must 
use  an  additional  metric  in  tracking  student  success — a  measure  such  as  stu¬ 
dent  engagement,  school  climate,  or  access  to  advanced  coursework.  Second, 
the  additional  flexibility  in  ESSA,  both  real  and  perceived,  is  likely  to  spur 
reforms  in  areas  that  have  been  unpopular  with  educators  or  that  have  trig¬ 
gered  scholarly  criticism.  Influential  groups  such  as  ASCD  and  the  NEA  have 
strongly  advocated  for  the  inclusion  of  multiple  measures  in  determinations 
of  school  success  (ASCD,  2014;  NEA,  2011),  as  have  prominent  academics 
(e.g.,  Darling-Hammond  et  al.,  2016). 

Emerging  Efforts  to  Measure  School  Quality  More 
Comprehensively 

What  other  dimensions  of  school  quality,  beyond  achievement  results  in  math 
and  English,  might  be  measured?  And  what  is  the  relationship  between  those 
new  measures  and  the  test  scores  that  presently  dominate  state  accountability 
systems? 

Much  discussion  has  revolved  around  Opportunity  to  Learn  (OTL)  mea¬ 
sures,  which  are  presumably  less  closely  tied  to  student  demography  and  more 
informative  about  what  is  going  on  inside  schools.  Model  OTL  frameworks, 
like  that  of  the  National  Council  for  Teacher  Education,  tend  to  emphasize 
school  culture,  teaching  environment,  learning  resources,  and  resources  from 
the  community  (National  Council  of  Teachers  of  English,  2012). 

There  has  also  been  a  great  deal  of  discussion  about  expanding  measure¬ 
ment  to  include  Social  and  Emotional  Learning  (SEL).  For  the  past  two 
decades,  groups  such  as  the  Collaborative  for  Academic,  Social,  and 
Emotional  Learning  (CASEL)  have  been  advocating  for  a  greater  emphasis 
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on  so  SEL,  as  have  organizations  such  as  the  National  Association  of  State 
Boards  of  Education  (NASBE,  2013).  Presently,  three  states — Illinois, 
Kansas,  and  Pennsylvania — have  adopted  comprehensive  SEL  standards 
with  developmental  benchmarks. 

Research  has  demonstrated  a  connection  between  OTL  and  SEL  variables 
on  one  hand,  and  student  standardized  test  scores  on  the  other.  In  a  study  of 
Chicago  schools,  Erbe  (2000)  found  that  “focus  on  learning,”  “school  com¬ 
mitment,”  and  “parental  involvement”  had  roughly  .5  to  .7  correlations  with 
math  achievement.  Some  studies  have  also  shown  that  specific  school 
resources  (e.g.,  funds  used  to  support  targeted  instruction)  have  an  effect  on 
achievement  (Archibald,  2006;  Lavy,  2012),  though  such  findings  are  not 
universally  true  across  the  scholarly  literature  (e.g.,  Hanushek,  1997,  2003; 
Houtenville  &  Conway,  2008).  Numerous  other  studies  document  the  con¬ 
nections  between  various  measures  of  school  quality  and  outcome  measures 
of  student  achievement  (Berkowitz,  Moore,  Astor,  &  Benbenishty,  2017; 
Cadima,  Peixoto,  &  Leal,  2014;  Darling-Hammond,  2000;  Kutsyuruba, 
Klinger,  &  Hussain,  2015;  Lubienski,  Lubienski,  &  Crane,  2008). 

Still,  it  seems  unwise  to  validate  various  measures  of  school  quality  solely 
by  establishing  relationships  with  standardized  achievement  scores.  Perhaps 
the  most  compelling  argument  against  such  a  practice  is  the  fact  that  different 
domains  of  school  quality  may  be  orthogonal,  which  is  to  say  that  successes 
in  some  areas  may  not  coincide  with  successes  in  others.  The  most  relevant 
research  in  this  area  comes  from  investigations  into  teacher  effectiveness. 
Jackson  (2016),  for  instance,  found  that  teachers  exhibit  variability  in  their 
effect  on  student  behaviors,  including  suspensions,  attendance,  course  grades, 
and  on-time  grade  progression,  as  well  as  longer  term  outcomes  such  as  high 
school  completion.  Furthermore,  these  teacher  effects  on  nontest  score  out¬ 
comes  exhibit  only  weak  positive  correlation  (p  =  .  16)  with  a  teacher’s  value 
added  to  standardized  achievement.  A  comparable  study,  which  uses  data 
from  more  than  1  million  students  in  the  Los  Angeles  Unified  School  District, 
reaches  similar  conclusions  about  the  multidimensionality  of  teachers  (Petek 
&  Pope,  2016).  In  other  words,  teachers  can  be  relatively  strong  in  raising 
student  achievement  without  improving  student  behavioral  outcomes,  and 
vice  versa.  These  and  other  studies  (e.g.,  Grissom,  Loeb,  &  Doss,  2015)  pro¬ 
vide  compelling  evidence  that  teacher  effectiveness  is  not  a  unidimensional 
construct,  which  in  turn  suggests  that  school  quality  is  not  either. 

A  number  of  districts  are  currently  employing  school  quality  frameworks 
(SQFs)  that  extend  beyond  academic  achievement  by  including  measures  of 
OTL  and/or  SEL.  The  Chicago  Public  Schools,  for  instance,  have  worked 
with  the  University  of  Chicago’s  Consortium  on  School  Research  to  employ 
the  5Essentials  framework.  This  framework  measures  the  effectiveness  of 
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school  leaders  to  implement  a  clear  and  strategic  vision,  the  level  of  support 
for  teachers,  the  involvement  of  families,  the  safety  and  orderliness  of  the 
school,  and  the  level  of  academic  challenge  in  classes.  Recently,  the  entire 
state  of  Illinois  adopted  the  framework,  making  the  case  that  “test  scores 
alone  do  not  provide  a  full  picture  of  teaching  and  learning  in  any  one  school" 
(5Essentials,  n.d.).  The  5Essentials  survey  is  taken  by  all  pre-K  through  12th- 
grade  teachers,  as  well  as  by  all  sixth-  to  12th-grade  students  in  Illinois,  and 
reports  generated  from  these  surveys  are  produced  for  all  schools  in  the  state. 

Similar  work  is  currently  being  done  by  the  California  Office  to  Reform 
Education  (CORE) — a  consortium  of  districts  that  collectively  educate  more 
than  1  million  students  in  the  state.  CORE’S  School  Quality  Improvement 
Index  is  built  around  a  100-point  scale:  60  points  allotted  for  the  academic 
domain,  and  40  for  social-emotional  and  school  culture  factors.  Within  the 
academic  domain,  two  thirds  of  points  are  determined  by  test  scores,  with  raw 
scores  and  growth  scores  counting  equally.  The  remaining  third  of  the  aca¬ 
demic  domain  is  determined  by  graduation  rates.  For  the  40  points  allotted  to 
social-emotional  and  school  culture  factors,  the  CORE  districts  rely  on  a 
broader  range  of  measures,  including  how  many  students  are  missing  signifi¬ 
cant  amounts  of  school,  how  many  are  suspended  or  expelled,  and  how  many 
English  language  learners  have  become  fluent.  In  addition,  the  CORE  districts 
plan  to  incorporate  results  from  school  climate  surveys  given  to  students,  par¬ 
ents,  and  teachers — a  practice  that  is  increasingly  supported  by  research  (e.g., 
Kane  &  Staiger,  2012;  Wilkerson,  Manatt,  Rogers,  &  Maughan,  2000). 

Given  criticism  from  scholars  and  the  public,  as  well  as  new  flexibility 
afforded  by  ESSA  legislation,  it  appears  that  state-level  measurement  and 
accountability  systems  will  expand  in  coming  years.  As  they  do,  it  seems 
likely  that  they  will  include  many  of  the  input  measures  that  fall  under  the 
umbrella  of  OTL,  and  also  many  of  the  outcome  measures  included  in  SEL 
frameworks. 

Current  Study 

Study  Population 

This  project  took  place  in  a  diverse,  mid-sized  urban  district  in  the  state  of 
Massachusetts.  In  the  2014-2015  school  year,  nearly  half  of  students  in  the 
district  were  Hispanic,  roughly  a  third  were  White,  and  the  remaining  stu¬ 
dents  were  mostly  Asian  and  African  American  (Massachusetts  Department 
of  Elementary  and  Secondary  Education,  2016).  The  district  had  nearly  5,000 
students  spread  out  across  its  one  early  learning  center,  seven  traditional  pri¬ 
mary  schools,  two  alternative  schools,  and  one  secondary  school.  The  district 
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serves  a  fairly  high-need  population,  with  more  than  one  third  of  students 
being  deemed  economically  disadvantaged  (ED)  and  nearly  one  in  five  being 
categorized  as  an  English  language  learner. 

The  “Beyond  Test  Scores”  Project 

In  the  spring  of  2014,  our  research  team  partnered  with  the  district,  which 
was  interested  in  measuring  school  quality  “beyond  test  scores.”  District 
administrators,  the  city’s  school  committee,  and  other  civic  leaders  had 
expressed  frustration  with  a  narrow  range  of  measures,  which  they  argued 
had  limited  their  ability  to  track  progress  across  many  aims,  and  which 
appeared  to  correlate  strongly  with  student  socioeconomic  status.  A  number 
of  stakeholder  groups  expressed  their  particular  interest  in  expanding  the  cur¬ 
rent  conception  of  school  quality  to  include  OTL  and  SEL  measures. 

Our  team  began  by  compiling  an  inventory  of  school  quality  factors  dis¬ 
tilled  from  national  polling,  educational  research,  and  community  surveys. 
This  generated  a  list  of  several  dozen  potential  variables.  Some  of  these  vari¬ 
ables  repeated  each  other,  differing  primarily  in  the  language  used  to  express 
them.  In  those  cases,  we  simply  selected  the  factor  with  the  clearest  wording. 
In  addition,  many  factors  on  the  list  seemed  to  be  of  different  grain  size — 
some,  for  instance,  were  quite  specific,  while  others  seemed  to  be  umbrella 
concepts  for  multiple  factors.  In  those  cases,  we  retained  the  smaller,  specific 
items,  and  set  aside  the  umbrella  terminology  for  later  in  the  process. 

Having  distilled  32  separate  factors  for  a  SQF,  we  organized  them  into  a 
hierarchical  taxonomy.  In  doing  so,  we  paired  together  similar  metrics,  such 
as  “student  sense  of  belonging”  and  “student-teacher  relationships,”  into 
measures — in  this  case,  “Relationships.”  Next,  we  nested  our  16  subcatego¬ 
ries  under  five  major  categories.  The  “Relationships”  measure,  for  instance, 
together  with  “Safety”  and  “Academic  Orientation,”  formed  a  major  cate¬ 
gory:  “School  Culture.”  This  approach  allowed  us  to  preserve  a  high  level  of 
complexity,  while  also  respecting  the  limits  of  working  memory  (Baddeley, 
1992;  Cowan,  2001). 

Throughout  this  process,  we  conducted  focus  groups  with  stakeholders 
in  the  community.  Ultimately,  we  conducted  10  focus  groups — three  with 
teachers,  two  with  principals  and  administrators,  and  five  with  parents  and 
community  members.  Although  educators,  administrators,  and  laypeople 
have  different  priorities  and  concerns,  these  different  constituencies  were 
able  to  agree  on  a  single  framework.  As  one  of  our  research  assistants  con¬ 
cluded  in  a  memo  analyzing  results  from  focus  groups  with  principals  and 
community  members:  “There  was  virtually  no  disagreement  between  the 
two  groups.”  We  found  similarly  strong  overlap  with  results  from  our 
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focus  groups  with  teachers.  No  longer  hearing  new  suggestions  from  our 
stakeholders — a  point  of  “saturation”  (Morse,  Barrett,  Mayan,  Olson,  & 
Spiers,  2002)  in  our  sampling — and  seeing  no  major  disagreements  among 
them,  we  sent  a  copy  of  the  framework  to  district  leaders  for  their  approval 
and  adoption.  Note  that  the  information  gathered  in  these  focus  groups  was 
used  to  engage  and  inform  stakeholders  in  the  construction  and  validation 
of  the  alternative  framework;  they  did  not  serve  as  a  collection  point  for 
the  school  quality  data  that  are  analyzed  in  this  study.  See  the  appendix  for 
a  copy  of  the  SQF,  which  outlines  all  metrics  that  are  hierarchically  com¬ 
bined  to  construct  the  framework.  For  those  who  wish  to  understand  the 
SQF  in  greater  detail,  see  Schneider  (2017)  for  information  regarding  the 
creation  of  the  SQF  and  the  use  of  focus  groups  in  that  process. 

Our  purpose  of  this  article  was  not  to  make  claims  about  the  effective¬ 
ness  or  generalizability  of  this  framework.  Instead,  we  have  provided  an 
overview  of  its  design  to  show  that  it  is  an  adequate  test  case  for  the  state- 
level  measurement  and  accountability  systems  that  will  emerge  in  coming 
years.  It  includes  many  of  the  most  commonly  discussed  OTL  measures,  as 
well  as  a  number  of  SEL  measures.  In  addition,  its  development  incorpo¬ 
rated  the  feedback  and  multiple  stakeholder  groups,  and  to  a  large  degree 
addresses  the  public  concerns  that  led  to  changes  in  federal  law  pertaining 
to  school  accountability. 


Data  and  Method 

There  are  three  primary  sources  of  data  used  in  this  study:  a  survey  of  stu¬ 
dents  in  the  district,  a  survey  of  teachers  in  the  district,  and  a  collection  of 
administrative  data  made  available  by  the  district.  The  unit  of  analysis  in  this 
study  is  the  school.  The  sample  includes  seven  elementary/middle  schools, 
five  of  which  are  pre-K-8,  one  of  which  is  K-8,  and  one  of  which  is  K-6.  To 
ensure  comparability,  the  only  high  school  in  the  district  was  excluded  from 
analysis,  as  were  an  early  childhood  center  and  two  small  alternative  schools. 
These  excluded  schools  serve  unique  populations,  and  therefore  do  not  lend 
themselves  to  norm-based  comparison  with  the  included  schools. 

Students  in  Grades  4  and  above  at  each  of  the  elementary/middle  schools 
were  issued  perception  surveys,  with  students  in  Grades  3  and  below  being 
excluded  due  to  concerns  of  age  appropriateness,  specifically  with  regard  to 
reading  comprehension  level.  The  survey  produced  a  student  sample  of 
1,607  students  or  98.2%  of  nonexcluded  population.  The  teacher  survey  was 
issued  to  and  completed  by  all  229  teachers  within  the  sample  schools. 
Administrative  data  were  collected  at  the  end  of  the  academic  year  for  all 
schools  in  the  sample. 
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Surveys  were  constructed  using  established  scales  when  available.  The 
internal  consistency  of  all  survey  scales  was  examined,  and  Cronbach’s 
alpha  was  calculated  for  all  scales.  To  examine  the  relationship  between 
data  sources,  all  metrics  were  normalized  to  have  a  mean  of  0  and  a  stan¬ 
dard  deviation  of  1.0.  Thus,  the  analytic  approach  used  in  this  study 
includes  district-normed  measures.  Consequently,  school  quality  is  calcu¬ 
lated  in  comparison  with  other  schools  within  the  district  analytic  sample. 
In  other  words,  a  zero-sum  approach  is  taken,  where  a  school’s  quality  is 
judged  with  respect  to  how  the  other  schools  in  the  district  perform  within 
a  given  metric.1 

We  examine  the  correlations  of  metrics  within  the  five  major  categories  of 
school  quality  in  our  framework.  Note  that  all  correlations  presented  in  this 
article  are  at  the  school  level,  which  aligns  with  our  interest  in  understanding 
how  factors  of  school  quality  relate.  We  then  sum  all  metrics  within  a  given 
school  measure  to  form  a  single  rating  for  that  school  quality  measure — first 
at  the  submeasure  level,  then  at  the  measure  level,  and  finally  at  the  major 
category  level. 

Next,  we  examine  the  correlations  between  major  school  quality  catego¬ 
ries.  We  generally  expect  to  see  positive  correlation,  as  schools  that  perform 
well  in  one  domain  might  be  expected  to  succeed  in  others  as  well.  However, 
we  do  not  anticipate  strong  correlations,  as  a  general  rule,  as  we  also  expect 
that  schools  will  exhibit  relative  strengths  and  weaknesses.  Again,  there  are 
theoretical  reasons  to  suspect  that  different  measures  of  school  quality  are 
orthogonal  to  various  degrees,  so  one  would  expect  commensurate  empirical 
variability  as  well.  We  then  form  a  composite  (i.e.,  overall)  SQF  ranking  by 
summing  combined  z  scores  over  all  categories.  This  allows  us  to  examine 
how  school  rankings  might  differ  across  the  alternative  and  traditional  frame¬ 
works.  We  then  dive  deeper  into  these  relationships,  reporting  the  average 
school-level  correlations  between  SQF  metrics  and  the  state’s  Progress  and 
Performance  Index  (PPI),  which  is  currently  used  by  the  Department  of 
Elementary  and  Secondary  Education  to  rate  schools,  and  which  is  discussed 
in  greater  detail  later  in  the  article.2 

Finally,  we  examine  the  relationship  between  each  system  (the  existing 
state  accountability  system  and  the  alternative  SQF  system)  and  school  pov¬ 
erty.  To  do  so,  we  calculate  the  average  school-level  correlations  between  the 
state  PPI  system  and  school  poverty,  using  the  percentage  of  ED  students  in 
each  school.  We  then  perform  the  same  calculation  for  the  SQF  metrics, 
reporting  average  correlations  for  each  of  the  major  categories  in  the  SQF  to 
illustrate  which  domains  of  the  alternative  framework  mirror  school  poverty, 
and  which  do  not.  Through  these  analyses,  we  address  two  primary  research 
questions: 
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Research  Question  1 :  To  what  extent  does  a  more  holistic  array  of  school 
quality  data  capture  information  not  reflected  in  current  accountability 
systems?  Specifically,  what  are  the  correlations  between  SQF  metrics  and 
the  Massachusetts  PPI,  and  how  might  school  rankings  differ  under  the 
two  frameworks? 

Research  Question  2:  How  might  existing  data  systems  reflect  the  out- 
of-school  context  tied  to  student  demography,  rather  than  something  about 
school  quality?  Specifically,  what  are  the  correlations  between  SQF  met¬ 
rics  and  school  poverty,  and  how  do  those  compare  with  the  correlation 
between  PPI  and  school  poverty? 

Findings 

Scrutinizing  the  Alternative  Framework 

Before  conducting  the  analyses  that  directly  address  our  research  questions, 
we  examined  the  measures  employed  in  the  alternative  framework  to  con¬ 
firm  that  they  met  basic  standards.  We  performed  factor  loading  for  the  22 
survey-based  metrics,  calculating  the  internal  consistency  of  this  composite 
using  Cronbach’s  alpha.  We  found  that  20  metrics  exceeded  .7 — long  held  as 
a  rule  of  thumb  in  scale  reliability  (Nunnaly,  1978) — the  remaining  two  sur¬ 
vey  scales  exhibiting  reliability  estimates  of  .69  and  .58.  Two  survey-based 
metrics — Arts  Exposure  (5Cia)  and  Physical  Activity  (5Diia) — were  based 
on  a  single  survey  question,  and  therefore  did  not  have  associated  reliability 
statistics.  We  consider  these  results  to  be  sufficient  to  conduct  the  subse¬ 
quent  analyses  necessary  for  this  study. 

Schools  exhibited  considerable  variation  in  most  metrics  (see  Table  1),  as 
indicated  by  standard  deviations  between  0.16  and  1.2  for  teacher  survey 
metrics,  and  0.09  and  0.46  for  student  survey  metrics,  on  a  5-point  Likert- 
type  scale.  The  remaining  six  metrics,  which  were  taken  from  district  admin¬ 
istrative  records,  also  displayed  meaningful  variation  in  our  sample.  Given 
the  norm-referenced  approach  to  this  study,  such  variation  is  a  necessary  pre¬ 
condition  for  the  remaining  analyses. 

To  illustrate  the  relationship  between  the  metrics  that  form  school  qual¬ 
ity  categories,  we  take  the  within-category  average  of  metrics-level  cor¬ 
relations.  The  average  within-category  correlations  are  shown  in  Table  2, 
and  range  from  .15  (Category  3,  Resources)  to  .45  (Category  2,  School 
Culture).  These  weak-to-moderate  average  correlation  magnitudes  gener¬ 
ally  support  the  grouping  of  such  metrics  to  form  school  quality  catego¬ 
ries,  as  they  show  a  positive,  but  not  deterministic,  relationship  between 
these  related  measures. 
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Table  I.  Descriptive  Statistics,  Standardized  School  Quality  Metrics. 


Source 

Average 

school 

z  score 

Minimum 

school 

z  score 

Maximum 

school 

z  score 

Standard 

deviation 

Professional  Preparation  Scale  (lAic) 

Teacher  survey 

0.03 

-0.34 

0.42 

0.26 

Pedagogical  Effectiveness  Scale  (lAiia) 

Student  survey 

0.02 

-0.25 

0.28 

0.17 

Interest  in  Students  Scale  (lAiiia) 

Student  survey 

-0.01 

-0.25 

0.25 

0.21 

T eacher  T urnover  ( 1  Bia) 

Administrative  data 

0.00 

-1.28 

1.29 

1.00 

PD  Scale  (IBiib) 

Teacher  survey 

-0.03 

-0.39 

0.54 

0.30 

Teacher  Principal  Trust  Scale  (IBiiia) 

Teacher  survey 

0.02 

-0.84 

0.59 

0.44 

Principal  Instructional  Leadership 

Teacher  survey 

-0.02 

-0.50 

0.59 

0.39 

Scale  (1  Biiib) 

Student  Safety  Scale  (2Aib) 

Student  survey 

0.14 

-0.29 

0.46 

0.24 

Peer  Victimization  Scale  (2Aiib) 

Teacher  survey 

0.09 

-0.39 

1.20 

0.55 

Peer  Support  Scale  (2Aiic) 

Teacher  survey 

0.03 

-0.89 

0.80 

0.54 

Sense  of  Belonging  Scale  (2bia) 

Teacher  survey 

0.01 

-0.18 

0.16 

0.1  1 

Student  Teacher  Relationship  Scale 

Student  survey 

0.01 

-0.25 

0.23 

0.18 

(2Biia) 

Chronic  Absences  (2Cia) 

Administrative  data 

0.00 

-1.19 

1.07 

1.00 

Academic  Press  Scale  (2Ciia) 

Student  survey 

0.02 

-0.23 

0.36 

0.20 

Art  Classes  per  Student  (3Aiia) 

Administrative  Data 

0.00 

-1.00 

1.57 

0.91 

Counselors  per  Students  (3 Aiib) 

Administrative  Data 

0.00 

-1.15 

1.15 

1.00 

Support  Staff  Scale  (3Aiid) 

Teacher  survey 

-0.03 

-0.36 

0.30 

0.28 

Curricular  Strength  Scale  (3Bif) 

Teacher  survey 

-0.05 

-0.37 

0.30 

0.22 

Class  Size  (3Biia) 

Administrative  data 

0.00 

-1.35 

1.91 

1.00 

Class  Size  Scale  (3Biib) 

Teacher  survey 

-0.01 

-0.41 

0.44 

0.33 

Parental  Engagement  Scale  (3Cia) 

Teacher  survey 

0.04 

-0.83 

0.79 

0.53 

Community  Engagement  Scale  (3Ciia) 

Teacher  survey 

0.00 

-0.70 

0.41 

0.43 

State  SGP  Score  (4Aia) 

Administrative  data 

0.00 

-1.69 

1.22 

1.00 

Student  Achievement  Scale  (4Aiia) 

Teacher  survey 

0.06 

-0.56 

1.05 

0.51 

Student  Engagement  Scale  (4Bia) 

Student  survey 

0.01 

-0.24 

0.30 

0.20 

Valuing  Learning  Scale  (4Biia) 

Teacher  survey 

0.00 

-0.18 

0.19 

0.12 

Problem-Solving  Scale  (4Cia) 

Teacher  survey 

0.02 

-0.35 

0.59 

0.32 

Appreciation  for  Diversity  Scale 

Student  survey 

0.02 

-0.1 1 

0.24 

0.13 

(5Aiia) 

Grit  Scale  (5Bia) 

Student  survey 

0.06 

-0.17 

0.27 

0.15 

Arts  Exposure  (5Cia) 

Teacher  survey 

-0.02 

-0.18 

0.09 

0.10 

Positive  Affect  Scale  (5Dia) 

Student  survey 

0.02 

-0.30 

0.29 

0.19 

Physical  Activity  (5Diia) 

Teacher  survey 

-0.01 

-0.28 

0.32 

0.23 

We  next  analyze  the  relationships  between,  as  opposed  to  within,  the  five 
major  categories  of  school  quality  in  the  SQF.  To  do  so,  we  aggregate  all 
normalized  metrics  to  form  a  single  categorical  school  quality  score.  In  an 
effort  to  evenly  weight  categories,  we  aggregate  from  lower  levels  upward. 
For  example,  by  combining  the  Class  Size  ratio  metric  (3Biia)  with  the  Class 
Size  Scale  metric  (3Biib),  we  formed  a  single  submeasure:  Class  Size  (3Bii). 
Then,  we  combined  Curricular  Strength  (3Bi)  with  Class  Size  (3Bii)  to  form 
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Table  2.  Average  Within-Category  Correlations  of  School  Quality  Metrics. 

Major  school  quality  category 

Average  within-category 
metric  correlation 

Teachers  and  the  Teaching  Environment  (1) 

.33 

School  Culture  (2) 

.45 

Resources  (3) 

.15 

Academic  Learning  (4) 

.17 

Character  and  Well-Being  (5) 

.17 

Table  3.  Correlations  of  Main  Categories  of  School  Quality. 


Teachers  and 


the  Teaching 

School 

Resources 

Academic 

Character  and 

Environment  (1) 

Culture  (2) 

(3) 

Learning  (4) 

Well-Being  (5) 

Teachers  and  the  Teaching 
Environment  (1) 

School  Culture  (2) 

1 

.64 

1 

Resources  (3) 

.34 

.20 

1 

Academic  Learning  (4) 

.29 

.70 

.66 

1 

Character  and  Well-Being  (5) 

.66 

.49 

.18 

.28 

1 

Table  4.  Rankings,  PPI  (State  Percentile  in  Parentheses),  SQF  Categories  (z  Score 
in  Parentheses),  SQF  Composite  Score  (Combined  z  Score  in  Parentheses). 


School 

Massachusetts 

PPI 

Teachers  and 
the  Teaching 
Environment  (1) 

School 

Culture 

(2) 

Resources 

(3) 

Academic 

Learning 

(4) 

Character  and 
Well-Being 

(5) 

SQF 

Composite 

T 

1  (89) 

2  (0.22) 

1  (0.46) 

1  (0.44) 

1  (0.55) 

5  (0.01) 

1  (1.69) 

U 

2(75) 

5  (-0.05) 

5  (-0.03) 

2  (0.31) 

3(0.11) 

3  (0.07) 

2  (0.41) 

V 

3(73) 

6  (-0.26) 

4  (0.046) 

5  (-0.22) 

2  (0. 1 3) 

4  (0.03) 

6  (-0.26) 

w 

4(53) 

1  (0.28) 

2  (0.33) 

6  (-0.25) 

6  (-0. 1 7) 

1  (0.10) 

3  (0.28) 

X 

5(50) 

7  (-0.30) 

7  (-0.39) 

4  (-0.14) 

7  (-0.33) 

7  (-0. 1 6) 

7  (-1.31) 

Y 

6(44) 

4  (-0.04) 

3  (0.12) 

7  (-0.27) 

4  (-0.00) 

6  (-0.01) 

5  (-0.20) 

z 

7(31) 

3  (0.14) 

6  (-0.25) 

3  (0.09) 

5  (-0.15) 

2  (0.07) 

4  (-0.1  1) 

Note.  PPI  =  Progress  and  Performance  Index;  SQF  =  school  quality  framework. 


a  single  measure:  Curricular  Resources  (3B).  Finally,  we  combined  relevant 
measures  to  create  a  major  category  score — in  this  case,  combining  Facilities 
and  Personnel  (3A),  Curricular  Resources  (3B),  and  Community  Support 
(3C)  to  form  Resources  (3).  Table  3  shows  the  Pearson  correlation  coeffi¬ 
cients  at  the  category  level.  All  correlations  shown  in  Table  4  are  positive, 
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with  magnitudes  varying  from  .  1 8  to  .70.  Overall,  these  findings  suggest  that 
categories  used  to  construct  the  framework  exhibit  meaningful  associations, 
while  not  being  deterministically  related. 

Relationships  Between  the  Alternative  Framework  Metrics  and 
the  State  System 

School  rankings.  One  common  criticism  of  existing  data  systems  is  that  they 
fail  to  measure  many  valued  aspects  of  school  quality.  We  now  turn  to  our 
first  research  question,  initially  by  exploring  how  school  rankings  might 
align  and  diverge  under  the  two  different  frameworks.  Specifically,  we 
sought  to  determine  if  the  holistic  picture  of  school  quality  is  somehow  being 
captured  by  the  current  PPI,  which  is  used  by  the  Massachusetts  Department 
of  Elementary  and  Secondary  Education  to  classify  schools  into  one  of  five 
accountability  and  assistance  levels.  The  PPI  is  a  single  number  between  0 
and  100,  produced  by  combining  test  score  results — specifically  progress 
toward  100%  proficiency,  as  well  as  average  test  score  improvement  mea¬ 
sured  by  the  state’s  Student  Growth  Percentile  (SGP) — along  with  gradua¬ 
tion  and  drop-out  rates.  It  is  possible,  after  all,  that  while  the  information  may 
not  be  presented  in  the  PPI  system,  it  is  nevertheless  accounted  for.  Table  4 
lists  the  seven  study  schools  in  order  of  their  PPI  ranking.  Presented  along¬ 
side  this  ranking  scheme  are  six  alternative  rankings — ranked,  respectively, 
by  each  of  the  five  major  categories  of  the  alternative  SQF  model,  as  well  as 
by  the  composite,  or  summative  z  score,  of  those  five  categories. 

Overall,  we  see  some  agreement  in  overall  rankings  between  the  two 
frameworks.  Two  schools  (T  and  U),  for  instance,  maintain  their  relative  place 
when  comparing  the  state  PPI  ranking  with  the  alternative  composite  SQF 
ranking.  Two  schools  (W  and  Y)  move  one  place,  while  one  school  (X)  moves 
two  places  when  switching  from  the  PPI  model  to  the  SQF  model.  Two  schools 
moved  three  spots.  School  V,  which  was  third  according  to  the  PPI,  was  sixth 
in  the  composite  SQF  model,  as  it  had  relatively  low  scores  in  all  categories 
except  Academic  Learning;  school  Z  jumped  from  seventh  place  to  fourth.  It 
is  worth  noting  here  that  some  categories  drive  overall  SQF  rankings  much 
more  than  others.  School  Culture,  for  instance,  exhibits  a  range  of  0.85  SD 
across  the  seven  schools.  By  contrast,  the  highest  and  lowest  scoring  schools 
in  the  Character  and  Well-Being  category  are  separated  by  only  0.26  SD. 

Perhaps,  the  most  important  aspect  to  focus  here,  however,  is  not  the  over¬ 
all  congruence  in  rankings  between  frameworks  but  rather  the  variability 
across  individual  measures  that  speaks  to  the  multidimensionality  of  school 
quality.  As  seen  in  Table  4,  positional  changes  are  more  substantial  within 
individual  categories  than  in  the  composite  of  all  five.  Schools  W  and  X,  for 
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instance,  are  quite  similar  in  terms  of  PP1  rankings — placing  in  the  53rd  and 
50th  percentiles,  respectively.  Yet,  differences  abound.  School  W  ranks  first 
in  both  Teachers  and  the  Teaching  Environment  and  in  Character  and  Well- 
Being  outcomes;  it  also  ranks  second  in  School  Culture.  School  X,  by  con¬ 
trast,  ranks  last  in  all  three  of  those  categories. 

Overall,  we  find  that  school  rankings  according  to  the  PPI  and  SQF  mod¬ 
els  exhibit  some  approximate  alignment,  with  a  few  notable  exceptions. 
Perhaps,  the  most  important  takeaway,  however,  is  that  examining  only  a 
single  index  measure — either  the  PPI  or  the  composite  SQF — obscures  the 
fact  that  some  schools  perform  well  in  some  domains  while  doing  relatively 
poorly  in  others.  Furthermore,  the  SQF  provides  a  rich  set  of  indicators  which 
schools  might  use  to  help  guide  policy  and  improvement  efforts;  the  same 
cannot  be  said  of  the  PPI. 

SQF  metrics  and  PPI.  We  now  explore  our  first  research  question  more  deeply, 
reporting  on  correlations  between  individual  metrics  from  the  more  compre¬ 
hensive  SQF  and  the  state  system  to  understand  what  specific  information  is 
not  being  captured  by  the  PPI  calculation  (see  Table  5).  We  find  considerable 
variability  not  only  in  the  relationships  between  individual  metrics  and  the 
PPI  but  also  between  PPI  and  the  average  correlations  of  major  categories  of 
the  SQF.  Moreover,  the  strengths  of  the  relationships  between  various  SQF 
metrics  and  PPI  offer  suggestive  evidence  as  to  the  ways  in  which  PPI  may 
inadequately  measure  a  fuller  conception  of  school  quality.  In  general,  we 
find  that  this  relationship  is  usually  stronger  when  a  metric  was  related  to 
student  achievement  or  family  background,  and  lower  when  it  was  more  a 
reflection  of  educational  opportunity. 

Metrics  within  the  Teachers  and  the  Teaching  Environment  category  exhibit 
an  average  correlation  of  .08  with  the  PPI.  In  addition,  metrics  within  this  cate¬ 
gory  have  very  different  associations  with  the  PPI.  Three  metrics  within  Teachers 
and  the  Teaching  Environment  exhibited  moderately  negative  correlations: 
teacher  perceptions  of  the  usefulness  of  professional  development,  student  per¬ 
ceptions  of  the  level  of  teacher  interest  in  students,  and  principal  leadership.  In 
other  words,  teachers  in  lower  PPI  schools  exhibited  greater  interest  in  students, 
as  measured  by  student  perception  surveys,  and  also  found  their  professional 
development  to  be  more  useful.  This  serves  as  a  powerful  instance  of  the  ways 
in  which  a  more  holistic  measure  of  school  quality — and  the  quality  of  the 
teacher  environment,  specifically — may  capture  important  aspects  of  the  school¬ 
ing  enterprise  which  are  not  included  in  current  measurement  and  accountability 
frameworks.  Put  another  way,  relationships  between  test  scores  and  other  school 
quality  variables  are  not  always  strong,  and  specific  strengths  and  weaknesses 
may  be  hidden  even  if  an  aggregate  relationship  is  positive. 
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Table  5.  Pearson’s  Correlation  Coefficients  Between  SQF  and  the  Massachusetts 
PPI,  Rate  of  ED  Students. 


Pearson’s  correlation 
coefficient 


Metric 

PPI  percentile 
rank 

%  ED 
students 

Teachers  and 

Professional  Preparation  Scale  ( 1  Aic) 

.86 

-.92 

the  Teaching 

Pedagogical  Effectiveness  Scale  (lAiia) 

.51 

-.71 

Environment 

Interest  in  Students  Scale  (lAiiia) 

-.36 

.02 

Teacher  Turnover  ( 1  Bia) 

.14 

-.51 

PD  Scale  (IBiib) 

-.47 

.14 

Teacher  Principal  Trust  Scale  (IBiiia) 

.06 

-.42 

Principal  Instructional  Leadership  Scale  (IBiiib) 

-.19 

-.32 

Average  Correlation  for  Category  1  Metrics 

.08 

-.39 

School  Culture 

Student  Safety  Scale  (2Aib) 

.91 

-.74 

Peer  Victimization  Scale  (2Aiib) 

.73 

-.92 

Peer  Support  Scale  (2Aiic) 

.27 

-.75 

Sense  of  Belonging  Scale  (2bia) 

.02 

-.39 

Student  Teacher  Relationship  Scale  (2Biia) 

.65 

-.82 

Chronic  Absences  (2Cia) 

.29 

-.62 

Academic  Press  Scale  (2Ciia) 

.58 

-.59 

Average  Correlation  for  Category  2  Metrics 

.49 

-.69 

Resources 

Art  Classes  per  Student  (3Aiia) 

-.02 

-.05 

Counselor  per  Students  (3Aiib) 

.74 

-.53 

Support  Staff  Scale  (3Aiid) 

-.05 

-.34 

Curricular  Strength  Scale  (3Bif) 

.31 

-.80 

Class  Size  (3Biia) 

-.40 

.41 

Class  Size  Scale  (3Biib) 

.28 

-.04 

Parental  Engagement  Scale  (3Cia) 

.71 

-.76 

Community  Engagement  Scale  (3Ciia) 

.46 

-.58 

Average  Correlation  for  Category  3  Metrics 

.25 

-.34 

Academic  Learning 

State  SGP  Score  (4Aia) 

.61 

-.33 

Student  Achievement  Scale  (4Aiia) 

.77 

-.99 

Student  Engagement  Scale  (4Bia) 

-.22 

-.07 

Valuing  Learning  Scale  (4Biia) 

-.27 

-.02 

Problem-Solving  Scale  (4Cia) 

.76 

-.95 

Average  Correlation  for  Category  4  Metrics 

.33 

-.47 

Character  and 

Appreciation  for  Diversity  Scale  (5Aiia) 

.76 

-.43 

Well-Being 

Grit  Scale  (5Bia) 

-.32 

.19 

Arts  Exposure  (5Cia) 

-.10 

-.25 

Positive  Affect  Scale  (5Dia) 

.00 

-.36 

Physical  Activity  (5Diia) 

-.08 

-.53 

Average  Correlation  for  Category  5  Metrics 

.05 

-.28 

Average  correlation  for  all  metrics 

.25 

-.44 

Note.  PPI  =  Progress  and  Performance  Index;  SQF  =  school  quality  framework;  ED  =  economically 
disadvantaged. 
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We  find  that  School  Culture  metrics  exhibited  the  strongest  relationship  to 
PPI,  with  an  average  correlation  of  moderate  magnitude  (p  =  .49).  Given  the 
robust  association  between  school  climate  and  achievement  (Thapa,  Cohen, 
Guffey,  &  Higgins-D’Alessandro,  2013),  such  a  finding  was  anticipated.  The 
seven  individual  metrics  that  form  School  Culture — taken  from  the  teacher 
survey,  student  survey,  and  district  administrative  data — all  exhibit  positive 
associations  with  state  PPI,  although  the  magnitude  of  these  correlation  coef¬ 
ficients  varies  widely.  This  suggests  that  even  in  the  case  where  broader  con¬ 
structs  exhibit  a  strong  connection  to  the  PPI,  the  metrics  composing  the 
broader  construct  are  likely  to  behave  differently. 

On  average,  the  metrics  fonning  the  Resources  category  displayed  weak-to- 
moderate  positive  correlations  (p  =  .25)  with  PPI.  Within  Resources,  the  paren¬ 
tal  engagement  and  community  engagement  scales — both  drawn  from  the 
teacher  survey — exhibited  strong  and  moderate  correlations,  respectively,  with 
the  state  PPI  calculation.  This  might  not  be  surprising,  given  that  student 
achievement  is  strongly  influenced  by  home  and  community  effects  that  might 
be  reflected  in  these  engagement  scales.  A  number  of  metrics  within  the  control 
of  the  school,  however,  had  negative  correlations  with  PPI;  these  metrics  were 
art  classes  per  student,  the  support  staff  scale,  and  class  size.  The  near-zero 
relationship  between  PPI  and  art  classes  per  student  is  to  be  expected,  given 
that  PPI  likely  does  not  capture  the  benefits  of  arts  education.  These  findings 
suggest  that  Resources,  which  generally  reflect  OTL  concepts  more  than  out¬ 
come-based  indicators  of  school  quality,  are  not  captured  very  well  by  the  PPL 

Academic  Learning  metrics  and  PPI  were  also  weakly  to  moderately  cor¬ 
related  (p  =  .33).  However,  when  one  looks  at  metric-level  correlations  within 
the  Academic  Learning  category,  a  compelling  trend  emerges.  Three  metrics 
were  strongly  correlated  with  PPL  The  first  of  those,  the  state  SGP,  is  one  of 
four  components  of  the  PPL  Consequently,  the  strong  correlation  between  the 
two  is  to  be  expected.  Similarly,  the  Student  Achievement  Scale — a  teacher 
survey  measure  that  captures  perceptions  about  student  work  ethic  and  per¬ 
formance — and  the  Problem-Solving  Scale — measuring  teacher  perceptions 
of  student  higher  order  thinking  skills — also  correlate  highly  with  PPL 
However,  two  metrics  exhibited  a  negative  correlation  with  PPL  the  student 
engagement  and  valuing  learning  scales,  both  of  which  seek  to  measure  stu¬ 
dent  connectedness  to  learning.  One  may  view  these  latter  two  metrics  as 
reflecting  an  OTL,  whereas  the  former  three  metrics  better  capture  the  level 
of  student  performance.  Thus,  this  finding  suggests  that  PPI  may  be  capturing 
student  performance  without  doing  a  particularly  good  job  of  representing 
the  extent  to  which  teachers  provide  students  with  a  chance  to  learn  by  get¬ 
ting  students  engaged  in  the  process  of  their  own  learning.  In  fact,  schools 
that  perform  better  on  student  achievement  might  be  damaging  the  intrinsic 
value  of  learning  in  the  process. 
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Metrics  within  the  Character  and  Well-Being  category,  on  average,  exhibit 
near-zero  correlation  (p  =  .05)  with  the  PPL  This  is  largely  due  to  one  metric — 
appreciation  for  diversity — having  a  strong  positive  correlation,  and  the 
remaining  four  metrics  exhibiting  negative  correlations  to  PPI.  Somewhat  sur¬ 
prisingly,  schools  in  the  district  exhibited  very  little  variability  on  metrics 
within  Character  and  Well-Being  category.  In  other  words,  while  schools  dif¬ 
fer  dramatically  according  to  the  PPI,  they  look  remarkably  similar  when 
comparisons  are  made  using  SEL  indicators.  This  may  be  due  to  the  fact  that 
the  district  exerts  a  stronger  influence  than  the  school  in  this  domain,  or  it  may 
be  due  to  measurement  error.  Whatever  the  case,  though,  it  seems  to  call  for 
more  thorough  investigation. 

Relationships  Between  the  Alternative  Framework  Metrics  and 
School  Poverty 

One  of  the  strongest  criticisms  of  existing  data  systems  is  that  they  uninten¬ 
tionally  capture  out-of-school  factors  tied  to  student  demography.  In  other 
words,  rather  than  measuring  schools,  they  are  measuring  families  and  neigh¬ 
borhoods.  Here,  we  answer  our  second  research  question  by  examining 
whether  SQF  metrics  relate  as  strongly  to  school  poverty  as  does  the  existing 
state  PPI  score,  which  is  heavily  reliant  on  raw  standardized  test  scores. 

The  Massachusetts  PPI  calculation  exhibits  a  very  strong  negative  correlation 
(p  =  -.80)  to  the  percentage  of  ED  students  in  a  school,  while  the  relationships 
between  SQF  metrics  and  percent  ED  vary  in  magnitude  (see  Table  5).  Overall, 
the  average  correlation  for  all  metrics  (p  =  -.44)  is  roughly  half  as  large  as  the 
correlation  between  PPI  and  ED  rates.  In  three  major  categories — Teachers  and 
the  Teaching  Environment,  Resources,  and  Character  and  Well-Being — we  see, 
on  average,  moderate  negative  correlations  to  ED  rates.  While  several  metrics 
within  these  categories  are,  in  fact,  tightly  tied  to  school  poverty — for  example, 
Professional  Preparation  (p  =  -.92),  Curricular  Strength  (p  =  -.80),  Parental 
Engagement  (p  =  -.76) — most  exhibit  more  moderate  correlations.  In  fact,  five 
of  the  20  metrics  from  these  categories  (Interest  in  Students,  PD  Scale,  Art 
Classes  per  Student,  Class  Size  Scale,  Grit  Scale)  exhibit  near-zero  or  slightly 
positive  correlations,  while  Class  Size  has  a  positive  correlation  with  ED  rates  of 
moderate  magnitude  (p  =  .41),  representing  an  important  investment  made  by  the 
district  into  its  poorest  schools.  Unsurprisingly,  each  of  these  five  metrics  reflect 
OTL  or  SEL  themes  more  so  than  absolute  student  academic  performance. 
Conversely,  we  find  School  Culture  metrics  to  be  consistently  and  strongly  tied 
to  school  poverty. 

The  most  interesting  trends  to  emerge  from  this  particular  analysis  are  seen 
in  the  correlation  coefficient  between  ED  rates  and  Academic  Learning  mea¬ 
sures.  Two  metrics — the  Student  Achievement  Scale  and  the  Problem-Solving 


18 


Educational  Policy  00(0) 


Scale — exhibit  near-deterministic  relationships  with  the  percentage  of  ED  stu¬ 
dents  in  a  school,  with  correlation  coefficients  of -.99  and  -.95,  respectively. 
The  state  SGP  score,  which,  roughly  speaking,  measures  achievement  growth 
and  not  absolute  achievement,  still  exhibits  a  negative  correlation  of  moderate 
magnitude  (p  =  .33).  However,  two  metrics  within  Academic  Learning  (Student 
Engagement  Scale,  Valuing  Learning  Scale)  have  essentially  no  relationship  to 
school  poverty.  Given  that  these  latter  two  variables  are  closer  representations 
of  opportunity-to-leam  (student  perceptions  of  their  engagement  in  class, 
teacher  perceptions  of  whether  students  value  learning),  and  further  from  the 
more  absolute  learning  metrics  that  comprise  this  category  (teacher  perceptions 
of  student  ability  to  achieve,  problem  solve),  this  provides  a  compelling  exam¬ 
ple  of  how  a  holistic  accountability  system  provides  a  more  complete  picture  of 
quality  for  those  schools  serving  vulnerable  students. 

Discussion 

Measurement  systems  shape  school  priorities,  inform  policy,  and  affect 
parental  behavior.  They  also  constitute  the  basis  for  accountability  structures. 
This  study  examines  how  the  current  system  used  to  identify  school  quality 
in  Massachusetts — the  PP1 — compares  with  a  comprehensive  alternative  sys¬ 
tem  that  may  prefigure  accountability  systems  of  the  future.  Specifically,  we 
examine  what  the  state  model  fails  to  capture,  as  well  as  what  it  captures  but 
should  not.  We  find  that  the  PPI  calculation  does  roughly  align  with  a  more 
comprehensive  framework.  However,  the  PPI  system  suffers  from  several 
weaknesses  previously  identified  by  scholars,  educators,  and  the  public:  It 
offers  only  summative  information  that  cannot  be  used  for  school  improve¬ 
ment,  it  fails  to  capture  information  about  the  OTL  and  social-emotional 
learning,  and  it  strongly  reflects  school  demographics.  By  contrast,  the  SQF 
model — as  an  example  of  what  accountability  systems  of  the  future  might 
look  like — is  less  prone  to  these  particular  shortcomings.  Moreover,  whereas 
the  current  system  of  accountability  is  run  by  the  state,  a  system  like  that  of 
SQF — one  which  is  more  responsive  to  the  values  and  concerns  of  local 
stakeholder  groups — would  be  more  likely  to  empower  the  community  to 
drive  meaningful  school  reform.  This  is  especially  the  case  if  data  are  made 
easy  to  access  and  interpret. 

This  study  may  strengthen  the  hands  of  those  who  have  identified  weak¬ 
nesses  in  current  approaches  to  measurement  and  accountability.  For,  while 
there  is  much  agreement  that  current  systems  are  inadequate,  we  know  rela¬ 
tively  little  about  the  degree  to  which  an  alternate  system  would  offer  an 
improvement.  As  evidence  from  this  study  appears  to  indicate,  a  model  that 
includes  a  broader  range  of  metrics  represents  a  significant  step  forward. 
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What  Is  Not  Measured  by  Current  Systems 

One  important  theme  to  emerge  from  this  work  is  that  the  current  account¬ 
ability  framework  in  Massachusetts — typical  of  such  frameworks  in  other 
states — does  not  capture  all  of  the  elements  of  school  quality  that  stakehold¬ 
ers  deem  to  be  important.  Although  we  find  that  PPI  is  positively  related  to 
each  of  the  five  measures  we  use  to  define  school  quality,  this  relationship  is 
quite  weak  in  some  cases,  and  numerous  important  metrics  are  actually 
inversely  related  to  PPI.  Even  when  looking  only  at  the  metrics  which  com¬ 
prise  the  Academic  Learning  category  of  the  alternative  SQF  model,  we  see 
a  range  of  associations  which  might  indicate  that  PPI  fails  to  capture  certain 
components  of  academic  achievement.  There  are  a  number  of  important  prac¬ 
tical  and  political  implications  that  follow  from  this. 

School  Rankings.  This  work  documents  the  multifaceted  nature  of  school 
quality,  and  reveals  some  of  the  nuance  lost  when  school  quality  is  presented 
as  unidimensional.  In  cases  where  ranking  must  be  done  to  identify  low- 
performing  schools,  it  is  important  to  note  three  things:  First,  identified 
schools  may  have  different  strengths  and  weaknesses.  Second,  some  domains 
of  school  quality  may  differentiate  schools  quite  well,  while  others  may  be 
relatively  consistent  across  schools.  Third,  regardless  of  the  methodology 
used  to  rank  schools,  one  should  acknowledge  that  rankings  are  highly  depen¬ 
dent  on  the  metrics  that  are  chosen  for  inclusion,  and  that  such  choices 
involve  a  subjective  component. 

Policymakers,  of  course,  do  not  have  to  rank  schools.  But  if  they  are  going 
to,  such  a  high-stakes  practice  demands  a  more  complete  accounting  of 
school  quality. 

Gaming.  A  second  policy  implication  relates  to  the  unintended  consequences 
of  measurement  systems.  Critics  argue  that  traditional  accountability  sys¬ 
tems,  being  heavily  reliant  on  a  single  measure,  may  promote  gaming.  That 
is,  they  may  encourage  schools  to  improve  their  scores  on  a  performance 
indicator  without  actually  improving  their  overall  performance.  So,  while  it 
may  sometimes  be  the  case  that  rising  test  scores  indicate  gains  in  student 
learning,  it  might  also  be  the  case  that  rising  scores  indicate  a  narrowing  of 
the  curriculum  or  an  increased  emphasis  on  test-preparation  techniques — 
practices  that  would  accomplish  the  same  end  by  different  (and  problematic) 
means.  Consequently,  it  appears  to  be  in  the  best  interest  of  students  to  create 
a  system  that  is  harder  to  game. 

As  a  holistic  system  has  far  more  indicators,  and  its  constituent  metrics 
are  not  deterministically  related  to  each  other,  it  appears  less  likely  that 
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such  gaming  behaviors  would  develop.  That  said,  the  strongest  incentive  to 
game  performance  indicators  may  be  high-stakes  accountability  itself, 
which  places  tremendous  pressure  on  schools  to  achieve  measurable  results. 
Insofar  as  that  is  the  case,  policy  leaders  may  wish  to  revisit  not  only  the 
performance  measures  they  include  in  accountability  systems  but  also  the 
stakes  attached  to  those  systems. 

Actionable  data.  One  of  the  goals  of  any  accountability  system  should  be  to 
provide  useful  information  to  stakeholders.  Given  that  individuals  may  dif¬ 
ferentially  value  particular  aspects  of  school  quality,  it  seems  important  to 
report  on  multiple  measures.  A  more  comprehensive  school  quality  measure 
is  more  likely  to  align  with  the  interests  of  the  public  and  to  provide  them 
with  actionable  information.  It  is  also  likely  to  provide  schools  with  more 
information.  When  schools  are  able  to  see  how  they  compare  with  each  other 
along  a  range  of  metrics,  as  opposed  to  merely  achievement,  leaders  are  more 
apt  to  make  judgments  based  on  such  data. 

What  Is  (But  Should  Not  Be)  Measured  by  Current  Systems 

In  addition  to  current  accountability  systems  failing  to  measure  certain 
aspects  of  schooling,  they  indirectly  measure  family  and  neighborhood  char¬ 
acteristics.  Although  demography  is  not  destiny  and  though  schools  with 
similar  poverty  levels  do  vary  in  their  ability  to  improve  student  outcomes, 
the  relationship  between  school  poverty  and  accountability  ratings  is  quite 
strong.  Two  important  implications  flow  from  our  finding. 

Perceptions  of  school  quality.  Current  accountability  systems  are  highly  reliant 
on  test  scores.  Insofar  as  perceptions  of  school  quality  are  shaped  by  such 
systems,  then,  they  will  be  strongly  influenced  by  student  poverty.  Percep¬ 
tions  of  school  quality  matter  enormously,  likely  driving  a  subset  of  teachers 
and  parents  toward  higher  achieving  schools  and  away  from  those  identified 
as  struggling.  It  seems  quite  possible  that  such  a  Matthew  Effect  would 
increase  inequality  in  public  education,  with  systems  of  accountability  per¬ 
versely  hurting  the  very  schools  they  were  established  to  help. 

We  find  variability  in  the  relationship  between  SQF  metrics  and  school 
poverty,  and  observe  that,  on  average,  the  magnitude  of  this  correlation  is  half 
as  large  as  that  between  the  state  accountability  system  and  school  poverty. 
The  inclusion  of  a  more  comprehensive  set  of  indicators,  it  seems — particu¬ 
larly  if  they  were  less  tightly  coupled  with  socioeconomic  variables — might 
help  highlight  the  ways  in  which  schools  serving  historically  marginalized 
groups  are,  in  many  cases,  doing  rather  well. 
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Capacity  building.  Under  an  alternative  accountability  system,  the  relative 
strengths  and  weaknesses  of  schools  are  more  likely  to  emerge.  Indeed,  this 
analysis  reveals  important  areas  where  schools  with  low  test  scores  perfonn 
roughly  on  par  with  (or  better  than)  higher  scoring  schools.  This  is  not  to  say 
that  we  should  be  sanguine  about  low  test  scores.  Although  test  scores  are 
limited  indicators  of  school  quality,  they  do  indicate  something  about  basic 
literacy  and  numeracy,  and  by  extension,  about  a  core  function  of  public  edu¬ 
cation.  For  the  past  two  decades,  however,  low  test  scores  have  been  viewed 
as  the  sign  of  a  failing  school  and  have  served  as  the  basis  for  sanctioning  the 
schools  most  in  need  of  assistance.  A  more  holistic  framework,  which  more 
clearly  identifies  strengths  and  weaknesses,  and  which  identifies  inputs 
alongside  outputs,  may  help  shift  accountability  systems  away  from  punish¬ 
ing  schools  and  toward  capacity  building.  If  it  is  possible  to  determine  where 
and  why  a  school  is  weak,  and  if  it  is  clear  that  the  school  is  not  uniformly 
underperforming,  it  may  seem  less  reasonable  to  label  it  failing  or  to  slate  it 
for  closure. 

It  is  worth  noting  here  that,  although  schools  with  low  test  scores  have 
been  more  effected  than  those  with  relatively  high  scores,  a  more  holistic 
measurement  system  might  benefit  both  groups  of  schools.  As  we  find  in  this 
study,  schools  are  not  unifonnly  good  or  bad.  Thus,  schools  with  high  stan¬ 
dardized  test  scores  may  have  areas  of  relative  weakness  that  have  been  over¬ 
looked  and  therefore  unaddressed.  Seeing  schools  with  more  nuance,  then, 
may  lead  to  more  emphasis  on  capacity  building,  whether  student  standard¬ 
ized  test  scores  are  high  or  low. 

Limitations 

The  data  used  in  this  study  come  from  the  initial  pilot  year  of  an  initiative  to 
create  a  more  holistic  measure  of  school  quality,  using  a  new  framework  that 
may  be  refined  and  expanded  upon  in  subsequent  years.  There  are  clearly 
numerous  concerns  that  must  be  evaluated  before  any  new  metric  should  be 
used  in  a  high-stakes  accountability  system,  including  the  benefits  and  draw¬ 
backs  of  expanding  school  quality  measures,  as  well  as  possible  unintended 
consequences  of  doing  so.  In  addition,  no  alternative  accountability  frame¬ 
work  will  reflect  stakeholder  values  perfectly,  even  one  like  the  SQF,  which 
was  developed  in  collaboration  with  local  stakeholders.  This  study  also 
examines  only  seven  schools  in  a  single  district,  all  of  which  are  subject  to  a 
single-state  accountability  scheme.  Moreover,  this  sample  is  restricted  to  tra¬ 
ditional  primary/middle  schools,  and  only  students  in  Grades  4  and  above 
were  surveyed;  we  do  not  include  any  early  education  centers,  high  schools, 
or  alternative  schools  in  our  analyses.  Overall,  then,  one  should  be  very 
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cautious  about  generalizing  the  findings  presented  here  to  other  schools.  That 
said,  given  the  dearth  of  research  on  this  topic,  as  well  as  largely  similar 
incarnations  of  accountability  across  states  in  the  nation,  the  findings  here  are 
seemingly  germane  to  a  wide  audience. 


Conclusion 

The  recent  authorization  of  ESSA  will  likely  spur  the  inclusion  of  addi¬ 
tional  school  quality  metrics  in  measurement  and  accountability  systems, 
most  likely  in  the  form  of  opportunities  to  learn  and  socioemotional  learn¬ 
ing.  The  findings  in  this  study  support  the  continued  exploration  of  a 
more  holistic  measure  of  school  quality.  Current  accountability  systems 
measure  too  little  about  schools,  and  too  much  about  families  and 
neighborhoods. 

Insofar  as  accountability  systems  seek  to  encourage  efficient  and  effec¬ 
tive  use  of  resources,  it  seems  they  have  much  to  gain  from  the  kinds  of 
improvements  described  here.  But  we  must  recall  that  accountability  sys¬ 
tems  in  education  are  also  intended  to  promote  equity  for  our  most  vulner¬ 
able  students  who  deserve  a  fair  and  adequate  education.  For  this  task, 
current  measurement  and  accountability  systems  appear  even  less  up  to  the 
task.  By  stigmatizing  and  sanctioning  low-achieving  schools  without 
understanding  how  well  such  schools  perform  across  their  full  mission,  we 
exacerbate  inequality  of  opportunity.  Those  harmed,  as  a  result,  are  those 
most  in  need  of  our  care. 

The  accountability  system  of  the  future,  if  it  looks  like  what  we  imagine, 
will  not  be  perfect.  But  it  does  represent  a  significant  improvement. 
Policymakers  should  take  seriously  the  challenge  of  moving  forward,  and 
revising  existing  measurement  and  accountability  systems.  And,  as  they  do, 
they  should  remember  that  an  even  more  perfect  system  lies  even  further 
ahead.  Beyond  each  mountain,  another  mountain. 

Appendix 

School  Quality  Framework  (SQF) 

Essential  inputs 

1 .  Teachers  and  the  Teaching  Environment 
1A.  Knowledge  and  Skills  of  Teachers 
lAic.  Professional  Preparation  Scale 
lAiia.  Pedagogical  Effectiveness  Scale 
lAiiia  Interest  in  Students  Scale 
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IB.  Teaching  Environment 
IBia.  Teacher  Turnover 
IBiib.  Professional  Development  Scale 
IBiiia.  Teacher  Principal  Trust  Scale 
IBiiib.  Principal  Instructional  Leadership  Scale 

2.  School  Culture 

2A.  Safety 

2Aib.  Student  Safety  Scale 
2Aiib.  Peer  Victimization  Scale 
2Aiic.  Peer  Support  Scale 
2B.  Relationships 
2Bia.  Sense  of  Belonging  Scale 
2Biia.  Student  Teacher  Relationship  Scale 
2C.  Academic  Orientation 
2Cia.  Chronic  Absences 
2Ciia.  Academic  Press  Scale 

3.  Resources 

3  A.  Facilities  and  Personnel 
3Aiia.  Art  Classes  per  Student 
3Aiib.  Counselors  per  Students 
3  Aiid.  Support  Staff  Scale 
3B.  Curricular  Resources 
3Bif.  Curricular  Strength  Scale 
3Biia.  Class  Size 
3Biib.  Class  Size  Scale 
3C.  Community  Support 
3Cia.  Parental  Engagement  Scale 
3Ciia.  Community  Engagement  Scale 

Key  outcomes 

4.  Indicators  Of  Academic  Learning 

4 A.  Performance 
4Aia.  State  SGP  Score 
4Aiia.  Student  Achievement  Scale 
4B.  Student  Commitment  to  Learning 
4Bia.  Student  Engagement  Scale 
4Biia.  Valuing  Learning  Scale 
4C.  Critical  Thinking 
4Cia.  Problem-Solving  Scale 
4D.  College  and  Career  Readiness 
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5.  Character  and  Well-Being  Outcomes 
5A.  Civic  Engagement 
5Aiia.  Appreciation  for  Diversity  Scale 
5B.  Work  Ethic 
5Bia.  Grit  Scale 
5C.  Artistic  and  Creative  Traits 
5Cia.  Arts  Exposure 
5D.  Health 

5Dia.  Positive  Affect  Scale 
5Diia.  Physical  Activity 
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Notes 

1.  Although  steps  are  being  taken  to  produce  criteria-based  measures  of  school 
quality  for  this  district,  our  lens  in  this  study  is  norm  referenced.  Thus,  both 
the  school  quality  framework  used  here  and  the  state  Progress  and  Performance 
Index  (PPI)  compare  schools  with  each  other.  Given  the  nature  of  this  project,  we 
are  limited  to  comparisons  between  schools  within  the  district. 

2.  For  more  information  on  PPI,  see  http://profiles.doe.mass.edu/accountability/ 
report/aboutdata.aspx 
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